Apparatus, systems, and methods for batch and realtime data processing

ABSTRACT

A traditional data processing system is configured to process input data either in batch or in real-time. On one hand, a batch data processing system is limiting because the batch data processing often cannot take into account any data received during the batch data processing. On the other hand, a real-time data processing system is limiting because the real-time system often cannot scale. The real-time data processing system is often limited to dealing with primitive data types and/or a small amount of data. Therefore, it is desirable to address the limitations of the batch data processing system and the real-time data processing system by combining the benefits of the batch data processing system and the real-time data processing system into a single data processing system.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims benefit of the earlier filing date, under 35U.S.C. § 119(e), of

-   -   U.S. Provisional Application No. 61/799,986, filed on Mar. 15,        2013, entitled “SYSTEM FOR ANALYZING AND USING LOCATION BASED        BEHAVIOR”;    -   U.S. Provisional Application No. 61/800,036, filed on Mar. 15,        2013, entitled “GEOGRAPHIC LOCATION DESCRIPTOR AND LINKER”;    -   U.S. Provisional Application No. 61/799,131, filed on Mar. 15,        2013, entitled “SYSTEM AND METHOD FOR CROWD SOURCING DOMAIN        SPECIFIC INTELLIGENCE”;    -   U.S. Provisional Application No. 61/799,846, filed Mar. 15,        2013, entitled “SYSTEM WITH BATCH AND REAL TIME DATA        PROCESSING”; and    -   U.S. Provisional Application No. 61/799,817, filed on Mar. 15,        2013, entitled “SYSTEM FOR ASSIGNING SCORES TO LOCATION        ENTITIES”.

This application is also related to:

-   -   U.S. patent application Ser. No. ______, entitled “APPARATUS,        SYSTEMS, AND METHODS FOR ANALYZING MOVEMENTS OF TARGET        ENTITIES,” identified by the Attorney Docket Number        2203957-00123US2, filed on the even-date herewith;    -   U.S. patent application Ser. No. ______, entitled “APPARATUS,        SYSTEMS, AND METHODS FOR PROVIDING LOCATION INFORMATION,”        identified by the Attorney Docket Number 2203957-00124US2, filed        on the even-date herewith;    -   U.S. patent application Ser. No. ______, entitled “APPARATUS,        SYSTEMS, AND METHODS FOR CROWDSOURCING DOMAIN SPECIFIC        INTELLIGENCE,” identified by the Attorney Docket Number        2203957-00125US2, filed on the even-date herewith;    -   U.S. patent application Ser. No. ______, entitled “APPARATUS,        SYSTEMS, AND METHODS FOR ANALYZING CHARACTERISTICS OF ENTITIES        OF INTEREST,” identified by the Attorney Docket Number        2203957-00127US2, filed on the even-date herewith; and    -   U.S. patent application Ser. No. ______, entitled “APPARATUS,        SYSTEMS, AND METHODS FOR GROUPING DATA RECORDS,” identified by        the Attorney Docket Number 2203957-00129US1, filed on the        even-date herewith.

The entire content of each of the above-referenced applications(including both the provisional applications and the non-provisionalapplications) is herein incorporated by reference.

FIELD OF THE INVENTION

The present disclosure generally relates to data processing systems, andspecifically, to data processing systems that can process data usingbatch processing and real-time processing.

BACKGROUND

The system disclosed herein relates to receiving, processing and storingdata from many sources, representing the most “correct” summary of factsand opinions from the data, including being able to re-compute this inreal-time, and then using the results to respond to queries. As anexample, when a user inputs a query to a web-based system, mobile phone,or vehicle navigation system searching for a “child friendly Chineserestaurant in Greenwich Village that has valet parking”, the system canvery quickly respond with a list of restaurants matching, for example,the attributes: {“kid_friendly”:true,“category”:“Restaurant>Chinese”,“valet_parking”:true, “neighborhood”:“Greenwich Village”}. A mobilephone may then provide a button to call each restaurant. The informationdescribing each restaurant may be spread across many websites, sourcedfrom many data stores, and provided directly by users of the system.

A problem in the art is that all web pages, references, and data aboutall known businesses in the United States stored in any data store canbe so large as to not be understandable and query-able in real-time.Updating and maintaining such a large amount of information can bedifficult. For example, information describing businesses in the UnitedStates has more than billions of rows of input data, tens of billions offacts, and tens of terabytes of web content.

At the same time, new information is continuously becoming available andit is desirable to include such information in the production of queryresults. As an example, the system may learn that a restaurant no longeroffers valet parking, that the restaurant disallows children, or thatthe restaurant's phone number has been disconnected.

Accordingly, it is desirable to be able to update a system that producessearch results both on an ongoing basis (e.g., to account for newlywritten reviews) as well as on a whole-sale basis (e.g., to reevaluatethe entire data and use information contained within that may have beenpreviously unusable).

SUMMARY OF THE DISCLOSED SUBJECT MATTER

The disclosed system is configured to receive new data (e.g., real-timeupdates) and old data (e.g., data that has been gathered over a longperiod of time) and is configured to periodically update the systembased on a reevaluation of the new data and the old data.

Unlike systems that can only search web pages and return links tomatching web pages, the disclosed system can maintain attributeinformation about each entity and return the information directly. Inconventional systems, a web search for “Chinese restaurant with valetparking”, for example, can return links to web pages with the words“Chinese”, “restaurant”, “valet”, and “parking”. This will generallyinclude pages with statements like “there was no valet parking to befound” because the words “valet” and “parking” appeared in the text andwere therefore indexed as keywords for the web page. By contrast, thedisclosed system has attributes such as the category of the restaurant,and a value indicating whether the restaurant offers valet parking,which advantageously allow the system to respond with more meaningfulresults. Additionally, in the disclosed system, a user can operate as acontributor to correct the data. Also, the disclosed system caninterpret facts across many web pages and arrive at a consensus answer,which is then query-able to further improve results.

In embodiments of the disclosed system, a user can operate as acontributor that contributes data to the system. For example, a user mayprovide direct feedback to the system to correct a fact. Multiple suchsubmissions can be considered together by the disclosed system alongwith information on websites such as blogs about child friendlyrestaurants and summarized into a rapidly evolving data store that canquickly respond to queries. Therefore, users of the disclosed system canaccess the newly corrected phone number and a more accurate assessmentof its child-friendliness.

In some embodiments, the disclosed system can improve or expand theanalytic methods for understanding information on web pages or feedback.The analytic methods can be improved or expanded on an ongoing basis.For example, today's methods may be capable of extracting moreinformation from a data source compared to the last month's methods. Ifone web page includes facts as simple text while another has opinions incomplex prose, the disclosed system, using the last month's method, mayhave been able to process simple text data such as “Valet Parking: yes”but have been unable to process prose such as “There was no place topark, not even valet.” However, the disclosed system, using the today'smethod, may have expanded capability and be able to process the morenuanced prose data.

In general, in an aspect, embodiments of the disclosed subject mattercan include a computing system for generating a summary data of a set ofdata. The computing system can include one or more processors configuredto run one or more modules stored in non-tangible computer readablemedium. The one or more modules are operable to receive a first set ofdata and a second set of data, wherein the first set of data comprises alarger number of data items compared to the second set of data, processthe first set of data to format the first set of data into a firststructured set of data, generate a first summary data using the firststructured set of data by operating rules for summarizing the firststructured set of data, and store the first summary data in a datastore, process the second set of data to format the second set of datainto a second structured set of data, generate a second summary databased on the first structured set of data and the second structured setof data by operating rules for summarizing the first structured set ofdata and the second structured set of data, determine a differencebetween the first summary data and the second summary data, and updatethe data store based on the difference between the first summary dataand the second summary data.

In general, in an aspect, embodiments of the disclosed subject mattercan include a method for generating a summary data of a set of data. Themethod can include receiving, at an input module operating on aprocessor of a computing system, a first set of data and a second set ofdata, wherein the first set of data comprises a larger number of dataitems compared to the second set of data, processing, at a first inputprocessing module of the computing system, the first set of data toformat the first set of data into a first structured set of data,generating, at a first summary generation module of the computingsystem, a first summary data using the first structured set of data byoperating rules for summarizing the first structured set of data,maintaining the first summary data in a data store in the computingsystem, processing, at a second input processing module of the computingsystem, the second set of data to format the second set of data into asecond structured set of data, generating, at a second summarygeneration module of the computing system, a second summary data usingthe first structured set of data and the second structured set of databy operating rules for summarizing the first structured set of data andthe second structured set of data, determining, at a differencegeneration module of the computing system, a difference between thefirst summary data and the second summary data, and updating, by thecomputing system, the data store based on the difference between thefirst summary data and the second summary data.

In general, in an aspect, embodiments of the disclosed subject mattercan include a computer program product, tangibly embodied in anon-transitory computer-readable storage medium. The computer programproduct includes instructions operable to cause a data processing systemto receive a first set of data and a second set of data, wherein thefirst set of data comprises a larger number of data items compared tothe second set of data, process the first set of data to format thefirst set of data into a first structured set of data, generate a firstsummary data using the first structured set of data by operating rulesfor summarizing the first structured set of data, and store the firstsummary data in a data store, process the second set of data to formatthe second set of data into a second structured set of data, generate asecond summary data using the first structured set of data and thesecond structured set of data by operating rules for summarizing thefirst structured set of data and the second structured set of data,determine a difference between the first summary data and the secondsummary data, and update the data store based on the difference betweenthe first summary data and the second summary data.

In any one of the embodiments disclosed herein, the second set of datacomprises real-time data submissions, and the one or more modules areoperable to process the second set of data to format the second set ofdata into the second structured set of data in response to receiving thesecond set of data.

In any one of the embodiments disclosed herein, the computing system,the method, or the computer program product can include modules, steps,or executable instructions for processing the first set of data toformat the first set of data into the first structured set of data at afirst time interval, which is substantially longer than a second timeinterval at which the second set of data is formatted into the secondstructured set of data.

In any one of the embodiments disclosed herein, each of the firstsummary data and the second summary data comprises an entity identifierand a value associated with the entity identifier, and wherein thecomputing system, the method, or the computer program product canfurther include modules, steps, or executable instructions fordetermining the difference between the first summary data and the secondsummary data by determining that the first summary data and the secondsummary data include an identical entity identifier, and comparingvalues associated with the identical entity identifiers in the firstsummary data and the second summary data.

In any one of the embodiments disclosed herein, the computing system,the method, or the computer program product can include modules, steps,or executable instructions for providing the difference between thefirst summary data and the second summary data to other authorizedcomputing systems.

In any one of the embodiments disclosed herein, the computing system,the method, or the computer program product can include modules, steps,or executable instructions for providing the difference to otherauthorized computing systems via an application programming interface.

In any one of the embodiments disclosed herein, the computing system,the method, or the computer program product can include modules, steps,or executable instructions for providing the difference to otherauthorized computing systems as a file.

In any one of the embodiments disclosed herein, the computing system,the method, or the computer program product can include modules, steps,or executable instructions for combining at least the first set of dataand the second set of data to generate a third set of data, processingthe third set of data to format the third set of data into a thirdstructured set of data based on new rules for formatting a set of data,and generating a third summary data using the third structured set ofdata.

In any one of the embodiments disclosed herein, the first set of dataand the third set of data each includes a first data element, andwherein the first data element is associated with a first entity in thefirst summary data identified by the first entity identifier, whereinthe first data element is associated with a second entity in the thirdsummary data, and wherein the computing system, the method, or thecomputer program product can further include modules, steps, orexecutable instructions for associating the first entity identifier tothe second entity in the third summary data so that the first dataelement maintains its association with the first entity identifier inthe third summary data.

In any one of the embodiments disclosed herein, the first structured setof data comprises a grouping of data items based on an entity identifierassociated with the data items.

In any one of the embodiments disclosed herein, the computing systemcomprises at least one server in a data center.

In any one of the embodiments disclosed herein, the data store comprisesa plurality of data store systems, each of which is associated with aview, and wherein the one or more modules are operable to select one ofthe plurality of data store systems in response to a query based on theview associated with the query.

In any one of the embodiments disclosed herein, the computing system,the method, or the computer program product can include modules, steps,or executable instructions for identifying a third set of data receivedafter the generation of the second summary data, generating a thirdsummary data based on the third set of data, the first structured set ofdata, and the second structured set of data by operating rules forsummarizing the first structured set of data, the second structured setof data, and the third summary data, determining a difference betweenthe second summary data and the third summary data, and updating thedata store based on the difference between the second summary data andthe third summary data.

DESCRIPTION OF THE FIGURES

Various objects, features, and advantages of the present disclosure canbe more fully appreciated with reference to the following detaileddescription when considered in connection with the following drawings,in which like reference numerals identify like elements. The followingdrawings are for the purpose of illustration only and are not intendedto be limiting of the disclosed subject matter, the scope of which isset forth in the claims that follow.

FIG. 1 illustrates the common processing framework of the disclosedsystem in accordance with some embodiments.

FIGS. 2A-2C illustrate enlarged views of portions of FIG. 1.

FIG. 3 illustrates the Catchup process in accordance with someembodiments.

DETAILED DESCRIPTIONS

A traditional data processing system is configured to process input dataeither in batch or in real-time. On one hand, a batch data processingsystem is limiting because the batch data processing cannot take intoaccount any additional data received during the batch data processing.On the other hand, a real-time data processing system is limitingbecause the real-time system cannot scale. The real-time data processingsystem is often limited to dealing with primitive data types and/or asmall amount of data. Therefore, it is desirable to address thelimitations of the batch data processing system and the real-time dataprocessing system by combining the benefits of the batch data processingsystem and the real-time data processing system into a single system.

It is hard for a system to accommodate both a real-time processing and abatch processing because the data and/or processes for a real-timeprocessing and a batch processing are quite different. For example, in abatch processing system, a program cannot access the data processingresult until the entire data process is complete, whereas in a real-timeprocessing system, a program can access the processing result during thedata processing.

The disclosed data processing apparatus, systems, and methods canaddress the challenges in integrating the batch data processing systemand the real-time processing system.

Some embodiments of the disclosed system can be configured to processunstructured data and convert the unstructured data into a summary data.The summary data can be stored in one or more data stores, including,for example, one or more data storages and/or one or more databases, orone or more search servers and can be formatted and optionally indexedto be query-able using one or more data stores or one or more searchservers or using an application programming interface (API) by a thirdparty user.

The summary data can include one or more unique entities and at leastone attribute about those entities. One of the attributes about anentity can be an entity identifier that is unique amongst the entities.Additional attributes describe some properties of the entity such as aBoolean value (e.g. whether restaurant A has valet parking is “True” or“False”), an integer, a string, a set of characters, binary data (e.g.bytes representing an image), or arrays or sets of these types, or anyother combinations thereof.

Some embodiments of the disclosed system can be configured to generatethe summary data based on two types of data inputs: a bulk data inputand an intermittent data input. The bulk data input can refer to a largeamount of data that has been gathered over time. In some cases, the bulkdata input can refer to all data that the disclosed system has receivedover a predetermined period of time, which can be long. For example, thebulk data input can include raw information received from multiplecontributors or from a web-crawler over a long period of time. In someembodiments, the bulk data can be maintained in the disclosed systemitself; in other embodiments, the bulk data can be received from anotherstorage center viva communication interface. The intermittent data inputcan include a small amount of data that is provided to the disclosedsystem. The intermittent data input can include, for example, real-timedata submissions from contributors.

Some embodiments of the disclosed system can be configured to processboth types of data inputs using a common processing framework. Thecommon processing framework can include a real-time system that canrespond to the intermittent data input (e.g., small incrementalcontributions from contributors) and reflect changes based on thosecontributions in consideration along with data from the batch system inthe summary data in substantially real-time. The common professingframework can also include a batch system that can process the bulk datainput. The batch system can be configured to format the bulk data to beamenable for further processing, and use the formatted bulk data togenerate summary data.

In some embodiments, the batch processing system is configured togenerate summary data by formatting the unstructured data in the bulkdata inputs into a structured data. Then the batch system is configuredto group the elements in the structured data and generate arepresentative identifier for each group, also referred to as an entity.The batch system can then generate an identifier for each entity andcalculate attribute values describing each entity.

For example, when the large bulk data input includes 5 data elementsassociated with an existence of valet parking at a restaurant A, thenthe batch system can determine that those 5 data elements belong to thesame entity (e.g., restaurant A), and consolidate information associatedwith the 5 data elements. For instance, if 3 data elements indicate thatthe restaurant A has valet parking and 2 data elements indicate that therestaurant A does not have valet parking, then the batch system canconsolidate the 5 elements and indicate that the attribute “valetparking” for entity, restaurant A is “True.” This consolidation processis, in some ways, similar to the process disclosed in U.S. PatentApplication Publication No. 2011/0066605, entitled “PROCESSES ANDSYSTEMS FOR COLLABORATIVE MANIPULATION OF DATA,” filed on Sep. 15, 2009,which is herein incorporated by reference in its entirety.

In some embodiments, an entity is a distinct object that the systemelects to track. For example, the system can consider each physicalrestaurant (e.g. a chain with multiple locations would have an entityfor each location) as a separate entity (i.e. summary record) when thesystem receives reviews about each physical restaurant as data input.Similarly, when toothpaste from a particular brand comes in 3 sizes and4 flavors for each size, the system can maintain 12 distinct entitiesfor the toothpaste.

In some embodiments, the real-time system can be configured to updatethe summary data generated by the batch system as the real-time systemreceives intermittent data inputs from contributors. For example, if thereal-time system receives two additional data inputs from thecontributors, both indicating that the restaurant A does not have valetparking, then the real-time system can update the summary data toindicate that the attribute “valet parking” for entity “restaurant A” is“False.”

In some embodiments, the real-time system can be configured to leveragethe structured data generated by the batch system. For example, when thereal-time system receives intermittent data input from a contributor,the real-time system can consolidate the intermittent data input withthe structured bulk data generated by the batch system. This way, theamount of computation required by the real-time system can be reduced.

In some embodiments, the batch system can be configured to runperiodically, with a predetermined period. The batch system can beoperated less frequently compared to the real-time system since theamount of computation needed by the batch system is considerably largercompared to the amount of computation needed by the real-time system.For example, the batch system can be operated so as to update the systemon schedules like once an hour, once a week, or once a month. Thereal-time system can be operated more frequently than the batch system.For example, the real-time system can be configured to operate wheneverthe real-time system receives an intermittent data input, or on inputsbuffered over a short time frame such as 5 seconds or 5 minutes. Thebatch system can be updated with new intelligence and rules over timeand can process new data provided at a scale that is beyond the capacityof the real-time system.

FIG. 1 illustrates the common processing framework of the disclosedsystem in accordance with some embodiments. FIGS. 2A-2C illustrateenlarged views of portions of FIG. 1. The top portion of FIG. 1illustrates the processing performed by the real-time data system,whereas the bottom portion of FIG. 1 illustrates the processingperformed by the batch system.

Data Lifecycle

One aspect of the disclosed data processing system is the datalifecycle. A data can be categorized as one of following types as thedata progresses through the disclosed data processing system:Raw/Unprocessed Data 100 (see FIG. 2A), Unprocessed (Raw) Inputs 350(see FIG. 2B), QuickProcessed Inputs 150 (see FIG. 2B), QuickProcessedSummaries 190 (see FIG. 2C), FullProcessed Inputs 360 (see FIG. 2B), andFullProcessed Summaries 760 (see FIG. 2C). These data can be stored in anon-transitory computer readable medium. The non-transitory computerreadable medium can include one or more of a hard disk, a flash memorydevice, a dynamic random access memory (DRAM), a static random accessmemory (SRAM), or any combinations thereof.

Raw/Unprocessed Data

Raw/Unprocessed Data 100 is a data that is in a raw/unprocessed fond.For example, a web page about a restaurant might say somewhere in thecontent of the webpage “has valet parking.” In this case, the raw datais a copy of the entire web page. In the disclosed system, an input of{“valet_parking”:true} could, for example, originate from a webpage thatsaid “has valet parking.” As an additional example, the system maycontain a data store of restaurants, for example, a data storage havingrestaurant-related data and/or a database having restaurant-relateddata. Examples of unprocessed data can include:

-   -   The Internet home pages of restaurants in the data store    -   Reviews of the restaurants that appear in on-line blogs    -   Reviews of the restaurants provided by individuals hired to        provide data for the system    -   On-line articles about restaurants in the data store

In the disclosed system, Raw/Unprocessed Data 100 can be maintained forreprocessing (e.g., the raw data may be stored at periodic intervals sothat it can be used for new runs of the batch processing system). Thisis advantageous because new rules, which may be developed at a latertime, may be able to extract additional inputs when the raw data isreprocessed. For example, a website may have text saying “the valetscratched my car while parking it.” Even if an earlier run thatevaluated content on the website failed to form any inputs about valetparking, a subsequent run may extract an input of “valet_parking”:true.Because the disclosed system can store the raw data 100 forreprocessing, a batch process of the disclosed system can be rerunagainst raw data 100, and a new rule that understands the more complexstatement could, for example, extract an input of “valet_parking”:trueon a subsequent run.

Unprocessed (Raw) Inputs

Unprocessed Inputs 350 represent the original attribute values for anentity as they were received by the disclosed system from a contributor,third party system, web page, and/or any other suitable source ofinformation. For example, if a webpage stated somewhere in the contentof the webpage “has valet parking,” the statement “has valet parking” isan Unprocessed Input. Likewise, a website about a clothing store (rawdata) might contain the statement “50 percent off sale” (raw input). Asanother example, a contribution, from a contributor, updating theaddress of a business may contain “1801 ave of stars, los angeles”.Initially, the rules available when the data was first provided may havecaused this input to be ignored because the address is insufficient.However, a subsequent build with improved rules could refine it to be{“address”:“1801 Avenue of the Stars”, “city”:“LosAngeles”,“state”:“CA”,“zipcode”:“90067”}.

Unprocessed Inputs 350 may, for example, be stored in one or more of thefollowing: a file system, including a distributed file system, a datastore, such as a relational or non-relational (i.e. nosql) database.

FullProcessed Inputs and Summaries

FullProcessed Inputs 360 are Inputs that have been Processed in the mostrecent Batch Data Build. For example, if the raw input “has valetparking” were contained in an online restaurant review, the Batch DataBuild of the disclosed system could extract an processed input of“valet_parking”: true.

In some embodiments, each Batch Data Build may entirely replace theprevious set of Full Processed Inputs 360. For example, suppose thedisclosed system had entries for just one restaurant called “Joes” andfive websites provided facts about the restaurant. Two websites mightstate that the type of food served is “Chinese”. One website might statethat it's “Cantonese”. Another two websites might say that it is“Italian”. In this example, FullProcessed Inputs could include{“id”:“1”,“name”:“Joe's”, “cuisine”:“Chinese”, “source”:“website1”},{“id”:“1”,“name”:“Joe's”, “cuisine”:“Chinese”, “source”:“website2”},{“id”:“1”,“name”:“Joe's”, “cuisine”:“Cantonese”,“source”:“website3”},{“id”:“1”,“name”:“Joe's”, “cuisine”:“Italian”,“source”:“website4”},{“id”:“1”, “name”:“Joe's”, “cuisine”:“Italian”,“source”:“website5”}. Based on the current rules, the FullProcessedSummary 760 may have {“id”:“1”, “name”:“Joe's”, “cuisine”:“Italian”}because it trusted all contributions equally and “Italian” and “Chinese”were tied while “Cantonese” was treated as an independent cuisine. Inthis example, a rule could be improved to determine that “Cantonese” isa type of “Chinese” cuisine and is also more specific, resulting in aFullProcessed Summary 760 of

-   -   {“id”:“1”, “name”:“Joe's”, “cuisine”:“Chinese>Cantonese”}        when the Batch Data Build is run. In the disclosed system,        FullProcessed Inputs 360 and FullProcessed Summaries 760 can        change when the entire table containing them is replaced while        new incremental information is written to QuickProcessed tables.

QuickProcessed Inputs and Summaries

At the start of a Batch Data Build, QuickProcessed Inputs 150 andQuickProcessed Summaries 190, which represent the newly computed Inputsand Summaries since the start of the previous Batch Data Build, may beset aside or discarded, and an empty version of each of QuickProcessedInputs 150 and QuickProcessed Summaries 190 may be allocated in the datastore, such as a database. For example, a user on a mobile device mightnotice that “Joe's” restaurant is miscategorized as “Italian”. Thatuser, acting as a contributor, could submit a correction throughsoftware on her mobile device that sends the data to the Publicapplication programming interface (API) (FIG. 1, 130.) Thatcontributor's input could look like {“id”:“1”,“cuisine”:“Chinese”}. Oncethat input is processed, it could be saved to QuickProcessed Inputs 150and the entry for “Joes” could be re-Summarized. In this example, thenew Summary for “Joes” would then be{“id”:“1”,“name”:“Joe's”,“cuisine”:“Chinese”} and because it isdifferent than the previous FullProcessed Summary 760, the new Summarywould be saved to QuickProcessed Summaries 190. In the disclosed system,when determining the latest Summary for an entity, the system can checkfor the latest Summary in QuickProcessed Summaries 190 favoring thatover the FullProcessed Summary 760 which only changes in a Batch DataBuild.

Batch Processing and Lifecycle

From time to time, a Batch Data Build may be run to convert UnprocessedData and Inputs into finished view Summary Data. The output of a BatchData Build is FullProcessed Inputs 360 and FullProcessed Summaries 760.

Input Processing Block

In some embodiments, the Input Processing module 145, 720 can beconfigured to perform one or more of the extraction process, thecleaning process, the canonicalization process, the filtering process,and the validation process, each of which is described below.

Extraction

The Extraction step may, for example, include a selection of a fact foran attribute based on a matching rule from structured, semi-structured,and unstructured data. For example, the disclosed system could use thefact matching rule “name:[NAME]” to extract a name. In this example, theExtraction step includes selection of the name “Mc'Donalds” in a recordlike: {“name”:“Mc'Donalds”} using the fact matching rule “name:[NAME].”Additionally, in the Extraction step, the system could use a patternmatching rule like “***-***-****”, where the * symbol represents awildcard character, to select the phone number “123-456-7890” from textsuch as: “Tel: 123-456-7890.” As an additional example, the Extractionstep could interpret raw text like “This place has no high chairs for mychildren” to create a fact in the form: {“kid_friendly”:false}. Thedisclosed system can interpret raw text to create a rule by, forexample, using advanced natural language processing and parsing.

Cleaning

The Cleaning step comprises cleaning extracted data. Cleaning extracteddata may include a process to remove undesired or bad characters orentity attributes. For example, extraction of a fact matching the rule“Phone:[PHONE_NUMBER]” might incorrectly extract incorrect informationsuch as “Phone: call now!” or extract extra information like“Phone:123-456-7890 click here”. Cleaning can discard incorrect data orremove extra data that is not desired. For example, if “Phone: callnow!” were extracted, the Cleaning step could discard the data because“Phone: call now!” is incorrect data for a phone number. Additionally,if “Phone:123-456-7890 click here” were extracted, the Cleaning stepcould discard “click here” because “click here” is extra data that isnot part of the extracted phone number. In the disclosed system,incorrect data or extra data can be discarded or removed by, forexample, using two rules, such as a fact matching rule and a patternmatching rule. For example, using the fact matching rule“Phone:[PHONE_NUMBER],” the disclosed system could extract informationlike “Phone:123-456-7890 click here” and using the pattern matching rule“***-***-****,” the system could determine that “click here” is extradata and remove it during the Cleaning step.

Canonicalization

Canonicalization refers to a rules-driven step to convert data invarious formats into their preferred or canonical representation. Forexample, one contributor may describe a phone number as “123-456-7890”and a different contributor may submit “(123)456-7890”. Converting datainto a canonical representation makes it uniform and enables betterentity resolution and summarization. The disclosed system can performcanonicalization by, for example, using multiple pattern matching rulesand designating another pattern for the canonical representation. Forexample, using the pattern matching rule “***-***-****” and“(***)***-****,” with the former designated the canonicalrepresentation, the Canonicalization step could make the inputs“123-456-7890” and “(123)456-7890” uniform by representing them both as“123-456-7890.”

Filtering

Filtering refers to a rules-driven step to reject data that is notnecessarily incorrect, but does not meet some desired criteria. This caninclude rejecting inputs that don't match a particular category or haveinsufficient confidence. For example, a Science Fiction theme restaurantmight advertise that it is “located on the planet Earth in the Milky Waygalaxy.” While this statement is accurate, an embodiment of thedisclosed system might, for example, not have a category for the planetand galaxy where restaurants are located, and as such, the Filteringstep in this example would reject the input “located on the planet Earthin the Milky Way galaxy.” Of course, in alternate embodiments, thedisclosed system could have such categories. As an additional example,in an embodiment, the disclosed system could, for example, set athreshold of 100 visits for information from a website to be consideredreliable. In this example, if a website that had been visited only 15times contained the statement “it is the best store,” the system couldreject the input because it does not meet the confidence rule. In otherembodiments, the disclosed system could use other rules for determiningconfidence.

Validation

Validation refers to a rules driven step to reject data based onnon-conformance with certain criteria. For example, a phone number fieldwhere, after canonicalization, the phone number has fewer digits thanare expected for a valid phone number (e.g. Phone: 123), it is possibleto reject the attribute or the entire input based on failure to meetcertain criteria.

Real-Time Summarization

Embodiments of the disclosed system may perform a Real-timeSummarization process. Referring to FIGS. 1 and 2, in embodiments of thedisclosed system, the Quick Summarization process module 160 receivesQuickProcessed inputs 150, generated by the Input Processing module 145,and FullProcessed inputs 360, generated by the batch processing system.

The Quick Summarization process module 160 can be configured toaggregate and filter the QuickProcessed inputs 150 and FullProcessedinputs 360. For example, the Quick Summarization process 160 couldreceive QuickProcessed inputs 150 and FullProcessed inputs 360 regardingvalet parking. In this example, FullProcessed inputs 360 might includeinputs with the value “valet_parking”:false and QuickProcessed input 150might include inputs with the “valet_parking”:true. The QuickSummarization process module 160 can be configured to aggregate andfilter the QuickProcessed inputs 150 and the FullProcessed inputs 360 tocreate a QuickProcessed Summary 190. For example, after filtering andprocessing, the QuickProcessed Summary 190 for an entity might be“valet_parking”:true.

In some embodiments, the Quick Summarization process module 160 can beconfigured to maintain and index data in the QuickProcessed Inputs 150and the FullProcessed Inputs 360 in a sort order determined based, atleast in part, on one or more of the entity identifier, the identifierof the contributor or a user account that provided the data, thetechnology used to extract the data, the source or citation for thedata, and/or a timestamp. To this end, a connection or iterator iscreated to read data simultaneously starting from the first Input withthe desired entity identifier from both the QuickProcessed Inputs andthe FullProcessedInputs. In each case, the iterator is advanced oneither QuickProcessed Inputs or FullProcessed Inputs, whichever has theearlier timestamp. Whenever any of the attributes enumerated aboveexcept for the timestamp change, the previous input is added to the poolbeing considered while the others are ignored, thus allowing the systemto efficiently consider only the latest version of an input from a givenuser, extraction technology, and citation.

The Diff process module 200 can be configured to compare QuickProcessedSummaries 190, generated by the Quick Summarization process module 160,and the FullProcessed Summaries 760. For example, it might compare aQuickProcessed Summary 190 with the value “valet_parking”:true and aFullProcessed Summary 760 with a value “valet_parking”:false. Based onthe comparison, the Diff process module 200 could then broadcast theresult. For example, it could broadcast that the FullProcessed Summaries770 from the previous batch build indicates that there is no valetparking, whereas the QuickProcessed Summary 190 from the newly computedInputs and Summaries since the start of the previous Batch Data Build,indicates that there is valet parking.

Real-Time Processing Workflow

Referring to FIGS. 1 and 2, as indicated by the arrow Real-time 10, theupper portion of the diagram generally depicts real-time components ofthe system. The system receives External Contributions 100 as inputs.External Contributions 100 include bulk data contributions, webdocuments, and real-time submissions from contributors. For example, thesystem can receive bulk data contributions such as entire websites ordata stores, web documents such as individual web pages, and real-timesubmissions such as reviews on a website.

One source of External Contributions 100 are User Writes 110. UserWrites 110 could include, for example, direct input from contributors ona web form or a mobile device.

In some embodiments, the system can receive User Writes 110 via PublicAPI module 130. For example, User Writes 110 could be received through apublicly accessible endpoint such as a website that submits to thePublic API module 130 or through software on a website or mobile devicethat submits to the Public API module 130. User Writes 110 may includeidentifiers for the contributor, origin, and developer added to them forconsideration in summarization.

An input such as a User Write 110 may have an entity identifier (e.g.,entity_id) already included with it. An entity identifier can be astring of letters and numbers. In some embodiments, an entity identifiercan signify that the input is an update to an existing entity. If theinput does not have an identifier, the system can determine and assign atemporary identifier, referred to as a QuickProcessed Identifier, usingthe Resolve process module 120. The Resolve process module 120 can beconfigured to assign an entity identifier to an input or to match onerepresentation of a record to another. This makes it possible to clustersimilar inputs together and assign those that reference the same entitywith a common entity identifier. In many cases, inputs have differentattributes but reference the same entity. The Resolve process can beused to compare inputs, determine that the inputs reference the sameentity, and assign a common entity identifier to those inputs.

In some cases, the Resolve process module 120 can be configured toassign an identifier as a surrogate key generated by a) randomassignment, b) concatenating one or more input attributes (e.g.name+address), c) consistent hashing of one or more input attributes(e.g. md5(name+address)), or d) taking the assigned id of an existinginput if a sufficiently similar input exists (e.g. name, value, phone ofnew input is similar enough to name, value, phone of existing input) andgenerating a new surrogate key when it is not.

Once a QuickProcessed Identifier is determined for the input, InternalAPI module 140 can receive the input from Public API module 130. Beforeit is saved to storage, a copy of the original input, in its raw form,can be made. The raw copy can be saved to storage for Unprocessed Inputs350 so that it can later be reprocessed in batch with updated softwareor subjected to more expensive computation, including software that doesentity identifier assignment.

Additionally, Internal API module 140 can be configured to interact withthe Stitch process module 155 for rules driven moderation. The Stitchprocess module 155 for rules driven moderation can be configured todisplay or provide data submissions that match certain criteria to ahuman moderator or more expensive machine processes for furtherevaluation. For example, a new restaurant owner might wish to drivebusiness to his restaurant by diverting it from nearby restaurants. Thatrestaurant owner might sign up for an account as a contributor of one ofthe disclosed system's customers and submit information that all of theother restaurants are closed. The system could then determine that a newcontributor who has never interacted with the system has, on one day,reported that several local businesses have closed, causing a rule inthe system that looks for such patterns to flag those submissions andenqueue them for review by a human moderator. The human moderator couldin turn determine that the businesses are indeed still open and rejectthe submissions and further blacklist the contributor such thatadditional submissions will be ignored.

At the same time, the original raw input can be processed throughsoftware that performs Extraction, Cleaning, Canonicalization, andValidation, as described above, producing a QuickProcessed Input 150. Insome cases, the QuickProcessed Input 150 may have a QuickProcessedIdentifier attached. If the QuickProcessed Input passes Validation, itcan be saved to storage for QuickProcessed Inputs 150 and it can moveforward in the process to QuickSummarization process module 160.

Real-time QuickSummarization process module 160 can be configured toanalyze and combine the QuickProcessed Inputs 150 and FullProcessedInputs 360 for an entity in substantially real-time. QuickProcessedInputs 150 can represent new real-time inputs since the last Batch DataBuild process. FullProccessed Inputs 360 are inputs generated from aprevious Batch Process by the batch processing system. Together, theycomprise the full set of Inputs for each entity. For example, theReal-time QuickSummarization process module 160 could receiveQuickProcessed inputs 150 and FullProcessed inputs 360 regarding valetparking. In this example, FullProcessed inputs 150 might include inputswith the value “valet_parking”:false and QuickProcessed input 360 mightinclude inputs with the “valet_parking”:true. The QuickSummarizationprocess module 160 could then aggregate the QuickProcessed inputs 150and FullProcessed inputs 360 and then filter them using High ConfidenceFilter 170 and Low Confidence Filter 180 to create a QuickProcessedSummary 190. For example, after filtering and processing, theQuickProcessed Summary 190 for an entity might be “valet_parking”:true.

QuickProcessed Inputs and FullProcessed Inputs may be stored in a datastore 150, 360. The data store 150, 360 can be respectively clustered byan entity identifier (e.g. a uuid such as0e3a7515-44e0-42b6-b736-657b126313b5). This can allow re-Summarizationto take place quickly as new Inputs are received such that Inputs onlypertaining to an entity for which a new Input has been received areprocessed. QuickProcessed Inputs 150 and FullProcessed Inputs 360 may bestored in a sort order such that it facilitates processing the data instreams or skipping over Inputs that are determined to be superseded byInputs that are, for example, newer submissions from the same submitteror citing an identical reference.

In some embodiments, the QuickSummarization process module 160 can readInputs concurrently from the QuickProcessed Input data store 150 andFullProcessed Input data store 360, to facilitate choosing only theInputs that need to be considered in Summarization. The summarization ofthe Inputs can be represented or displayed using a view (e.g., amaterialized view). A view is one possible summarization of the Inputsand representation of entities according to one or more rules. The oneor more rules can determine which entities are included, whichattributes are included for each entity, what indexing optimizations areperformed, and what additional attributes and attribute variations arecomputed for each entity. In some embodiments, a view is uniquelyidentified by a view_id. Data stores often track views in system tablesand these system tables contain metadata about views. In our case, aview is assigned an identifier and that identifier is used to lookupmetadata about the view from the data stores, such as the names of theattributes, their datatype, sort preferences, indexing rules, etc.

For each view associated with the dataset to which the Input wasassigned, a QuickSummarization Process can be performed. Views may havedifferent rules about attributes to be computed, the rules that apply tothose attributes, confidence thresholds 170,180 for the Summary entity,and other software rules and transformations. Each QuickSummarizationprocess can produce a QuickProcessed Summary for each View. EachQuickProcessed Summary is compared to the most recent Summary retrievedfrom the QuickProcessed Summary data store 190 or the FullProcessedSummary data store 760. If the QuickProcessed Summary is different thanthe previous version, the new QuickProcessed Summary is saved to theQuickProcessed Summary data store 190 and a Diff record is produced. ADiff record can include a row of data that includes, for example, (1) anentity identifier of the entity whose attributes have changed and (2)the changed attributes. The Diff record may include an entire copy ofthe new Summary or the attributes that are different from the previousSummary. The Diff record is saved to a Diff data store and publishedover the network to processes that listen for Diff records and updateMaterializations of Summary data.

The following is an example of a Diff record in one possible embodiment:“timestamp”:1363321439041, “payload”: {“region”:“TX”,“geocode_level”:“front_door”, “tel”:“(281) 431-7441”,“placerank”:90,“category_labels”:[[“Retail”,“Nurseries and GardenCenters”]], “searchtags”:[“Houston”,“Grass”,“South”],“name”. “HoustonGrass”,“longitude”:“-95.464476”, “fax”:“(281) 431-8178”,“website”:“http://houstonturfgrass.com”, “postcode”:“77583”,“country”:“us”, “category_ids”:[164],“category”:“Shopping >Nurseries &Garden Centers”,“address”:“213 McKeever Rd”, “locality”:“Rosharon”,“latitude”:“29.507771”},“type”:“update”,“factual_id”:“399895e6-0879-4ed8-ba25-98fc3e0c983f”,“changed”: [“address”,“tel”]}. In this example, the Diff recordindicates that the address and telephone for Houston Grass have changed,which can result in an update to each copy of the materialized datastore or index rows for that entity.

Embodiments of the disclosed system may include materialized data storesor indexes 510, 520. Materialized data stores or indexes 510, 520 aresearchable relational or non-relational data stores or search indexservers. In some embodiments, the materialized data stores or indexes510, 520 can be associated with a particular application domain or aparticular data service. The disclosed system may utilize data storesystems like Postgre SQL (relational) and Apache Solr (non-relational,search server) interchangeably, sometimes for the same data, and canchoose the one that best services the type of query requested. Forexample, the disclosed system can receive a query for data associatedwith a particular view or type of data. In response, the disclosedsystem can determine one or more of a type of the query, a type of theentity, an application or a device that sends the query, an applicationdomain associated with the entity, or any relevant informationassociated with the query to determine one of the data store systems ora combination of such systems to use to respond to the query. Then thedisclosed system can use the determined one of the data store systems orcombinations of systems to respond to the query.

Batch Processing Workflow

Referring to FIGS. 1 and 2, as indicated by the arrow Batch 20, thelower portion of the diagram generally illustrates batch processingcomponents of the system.

The Batch Processing Workflow can receive Large Uploads and BulkContributions 700 such as

Raw Inputs 700

Universally Unique Identifier (uuid) attachment data

Message Digest 5 (md5) attachment data

In addition, the Batch Processing Workflow uses previously processeddata such as previous FullProcessedInputs and previousQuickProcessedInputs for steps such as UUID Retention and DiffGeneration. These steps are described below.

Pre Batch Build

Prior to initiating the Batch Build process, the real-time processeddata can be provided to a data store 710, such as a Hadoop DistributedFile System (HDFS), so they can be used as inputs for the Batch Build.When this step is initiated, the time can be recorded, that can be usedduring the Catchup Phase.

Specifically, the following data can be provided to the data store 710:

QuickProcessedSummaries 190—summaries that have been created since thelast Batch Data Build. This may include brand new summaries, deletedsummaries and summaries that have certain fields updated

Unprocessed Inputs 350—the raw inputs that have been written to thisdataset since the last Batch Data Build

new uuid mappings—a mapping of input ids to entity ids for new summariesthat were generated since the last batch run

In some embodiments, quick processed inputs from the previous version ofthe data are not used, except for UUID Retention (described below). Insuch embodiments, this is accounted for by using the Unprocessed Inputs350 instead. This ensures that inputs are completely reprocessed.

Batch Build

The Batch Build is a process in which the data can be processed and madeready for loading into production.

Input Processing:

Raw inputs 700 and Unprocessed Inputs 350 are fed into the InputProcessing module 720 from HDFS 710. The Extract step may not preserveany notion that the data was previously extracted. Extraction may bedone on the raw inputs 700.

The Extract step may use a rule framework that canonicalizes, cleans,fills in values and filters inputs as described above and as illustratedin the examples below.

123 main street=>123 Main St.

city: Los Angeles=>city: Los Angeles, state: CA

The Extract step can also sort out which inputs should be attached andwhich inputs should be batch-resolved. The Extract step can optionallydetermine that some inputs should be reviewed by a human, acomputationally powerful process, or a third party API. The Extract stepcan set a moderation action flag within the metadata of the input andinsert it directly or via API into the Stitch data store, which is usedto coordinate relatively costly processes such as moderation.

Batch Resolve:

Batch resolve, which can be performed by the Resolve process module 722,may take the extracted inputs and group them based on whether theyrepresent the same entity or not and assigns a unique id to each set ofinputs. For example, batch resolve can assign a unique id generated bya) random assignment, b) concatenating one or more input values from aset of inputs (e.g. name+address), c) hashing of one or more values forman asset of inputs (e.g. md5(name+address)), or d) taking the assignedid of an existing set of inputs if a sufficiently similar input exists(e.g. name, value, phone of new input is similar enough to name, value,phone of existing input).

UUID Retention:

After the batch resolve module 722 completes its process, the UUIDRetention module 725 can be initiated. An objective of the UUIDRetention module 725 can include modifying the identifier associatedwith entities (e.g., entity_id) so that a single entity, such as theEiffel Tower, can maintain the same entity identifier even when inputdata is re-processed by the batch processes (e.g., across multiple batchruns). This enables an entity to be associated with the same identifiereven when data associated with the entity is re-processed multipletimes.

This is accomplished, for example, by reading in the previousFullProcessedInputs 360, and generating a mapping file or table, whichincludes a mapping between an input_id an entity_id. An input_id is aunique identifier assigned to each set of attributes coming from asingle input data contribution. For example, all of the attributespulled from the homepage of the French Laundry, such as the name,address, phone number constitute one input data contribution. Lots ofother websites and contributors can also provide input datacontributions describing the French Laundry. Each of these input datacontributions has its own input_id that uniquely identifies it fromother input data contributions. The input_id can include a messagedigest 5 (md5) hash of an input data contribution. In contrast, anentity_id is an identifier assigned to all of the input datacontributions and the summary record of the French Laundry (currently aUUID).

In some embodiments, the input_id to entity_id mapping can be combinedwith the mappings for newly written summaries. For example, each inputin FullProcessed Inputs and QuickProcessed Inputs has an input_id thatuniquely identifies the original unprocessed input and an entity_idrepresenting the entity associated with the FullProcessed Inputs andQuickProcessed Inputs, as determined by the Resolve process module 120.Using the input_id to entity_id mapping, the input_id can be used toassign the original entity_id to all inputs data items that are withinthe same set.

In some embodiments, an example of UUID Retention could be as follows:

Mapping:

-   -   input_id_0, original_entity_id

Input Set:

-   -   input_id_0, new_entity_id, data    -   input_id_1, new_entity_id, data    -   input_id_2, new_entity_id, data

In this example, in the previous batch build, input_id_0, had the entityid “original_entity_id”. In the current Batch Build, since the sampleinput set contains the input with input id: input_id_0, the disclosedsystem can map all inputs in the input set to original_entity_id. Assuch, in this example, the end result would be as follows

End Result:

-   -   input_id_0, original_entity_id, data    -   input_id_1, original_entity_id, data    -   input_id_2, original_entity_id, data

The output of the UUID Retention module 725 can include a grouped set ofinputs that have the same entity_ids as they had in the previous batchrun, as well as preserving any entity_ids that were generated in betweenbatch runs. As described above, the UUID Retention module 725 canpreserve the same UUID for the same entity across Batch Builds.

In some cases entities may be merged or split, depending on the resultof Batch Resolve. The UUID Retention module 725 can, in effect, specifyhow to deal with split and merge cases. For example, in the case of amerge, it may be preferred to use the entity_id with the greater numberof inputs. In the case of a split, the entity_id may be assigned to theinput set that has the greater number of inputs and generate a new idfor the input cluster forming the new summary. This behavior can becustomized depending on the dataset and desired outcome.

Data Attachment

After UUID Retention, the Data Attachment process may be performed bythe Data Attachment module 727. A purpose of the Data Attachment processcan be to attach inputs that are (1) unresolvable, (2) derived from asummary, (3) derived from an input, or (4) for inputs that with asufficient degree of confidence, pertain to a specific entity_id, suchas contributor edits to a specific entity_id or an input that hasgeocode information pertaining to a specific input.

Data attachment can be based on an entity_id or an input_id. Forexample, the Data Attachment module 727 can be configured to attach (orcombine) a source input to a set of inputs generated from UUID Retentionwhen the source input has the same entity_id as that of the set ofinputs. As another example, the Data Attachment module 727 can beconfigured to attach (or combine) a source input when the source inputis associated with the same parent input_id as that of the set ofinputs, where the parent input id refers to a unique identifier of theinput to which the source input should be attached. These examples areillustrated with the following embodiments.

In some embodiments, an example of Entity_ID Data Attachment could be asfollows:

Attachment Data:

-   -   input_id_0, entity_id_0, data

Input Set:

-   -   input_id_1, entity_id_0, data    -   input_id_2, entity_id_0, data    -   input_id_3, entity_id_0, data

In this example, since the sample input set and the source data have thesame entity_id: entity_id_0, the attachment data is added to the sampleinput set.

End Result:

-   -   input_id_0, entity_id_0, data    -   input_id_1, entity_id_0, data    -   input_id_2, entity_id_0, data    -   input_id_3, entity_id_0, data

In some embodiments, an example of Input_ID Data Attachment could be asfollows:

Attachment Data:

-   -   input_id_0, (no entity id), parent_input_id: input_id_1, data

Input Set:

-   -   input_id_1, entity_id_0, data    -   input_id_2, entity_id_0, data    -   input_id_3, entity_id_0, data

In this example, since the sample input set and contains an input thatmatches the parent input id of the source data, the attachment data isadded to the sample input set.

End Result:

-   -   input_id_0, entity_id_0, data    -   input_id_1, entity_id_0, data    -   input_id_2, entity_id_0, data    -   input_id_3, entity_id_0, data

Extended Attribute Set Extraction

Extended Attribute Set Extraction is an additional extraction processperformed by the extended attributes module 728. The extended attributesmodule 728 can be configured to run extraction on certain inputs toextract an “extended attribute set”. An extended attribute set may notbe a part of the core attribute set, but may contain information thatpertains to specific views. For example, “vegan” is an attribute thatwould pertain to a restaurant view but not to a doctor's view.

In some embodiments, rules may be written in a rules framework thatdetermines whether a set of inputs is re-extracted for extendedattributes. For example, if a set of inputs has a single input that hasthe category “Restaurant”, all inputs in that input set can bere-extracted for extended attributes pertaining to restaurants.

The output of the extended attributes module 728 include final inputs729. The final inputs can be stored in the FullProcessed Inputs 360storage, which may relay the final inputs to the quick summarizationmodule 160.

Summarization

Summarization module 730 is configured to perform a summarizationprocess. The summarization process includes a process by which the finalrepresentation of a set of inputs representing the same entity can begenerated. Summarization module 730 can use a rules framework togenerate a summary based on the final inputs 729. Each dataset may havemultiple views, including side-effect views. Each set of inputs maygenerate multiple view summaries. Each of the summaries generated fromthe same set of inputs have the same entity_id.

A side-effect view includes a new view (e.g., set of summary entities)that does not have a one-to-one relationship with the entity id for thegiven inputs. A side-effect view can be generated as a by-product ofother views and their inputs rather than directly producing summariesfrom the associated entity inputs. The side-effect view allows thesummarization module 730 to provide an arbitrary number of summaryrecords (e.g., an arbitrary number of related entities) from a singledata input. One such example of this is Crosswalk, which is a view thatlinks entity_ids to specific input sources. For instance, the sideeffect view creation process can determine whether an input data matchesa rule, such as “is a namespace we track in crowsswalk” (e.g. because ithas a url like webname.com/[some_place_id]), and once there is a match,the side effect view creation process can create a new entity, forexample, with {“namespace”:“webname”, “id”:“[some_place_id]”,“factual_id”:“[id_of_referenced_entity]”}. Therefore, even if the inputdata is already associated with an entity, the side effect view creationprocess can generate additional entities associated with the input databased on a rule maintained by the side effect view creation process.

Following Summarization 730, the results may be filtered with HighConfidence filter 740 and Low Confidence Filter 750 as describedpreviously, and stored as FullProcessed Summaries 760.

Data store Format Generation process

In this process, the FullProcessedInputs 360 and FullProcessedSummaries760 are built. FullProcessedInputs 360 can include all inputs for agiven dataset and can be organized in a way where entity_id lookup andsummarization is efficient. FullProcessedSummaries 760 can contain allthe summary records for all views in a given dataset, organized in a waywhere entity_id and view_id lookup is efficient. These files can be bulkloaded into a data store during a MakeLive step. The output of thesethis step is represented by 729, 740, and 750 in FIG. 1.

Diff Generation

Diff Generation module 770 can be configured to generate all the “diff”records that comprise the difference between the current batch run andthe prior real-time updated dataset and output them to Diff API toDownload Partners 500, which allows authorized partners to download thedifference records from the system. Each such record can be referred toas a “diff.” Specific diff types are described above. Diffs can begenerated by comparing each summary for a view against the prior versionof the summary for that same view. Diffs can be generated for every viewfor each summary. The current summaries can be compared against theprior FullProcessedSummaries 760 and prior QuickProcessedSummaries 190tables. The same diff generation mechanism can be used to generate thediffs for the indexes 510, 520, and the diff for third parties to beprovided via the diff API 500.

The Diffs are also written to the Data store Format, which allows forefficient lookup based on date and entity_id.

Materialization Build

Materialization Build module 780 is configured to produce an outputformat that is ready for serving other computing systems, such as datastores. For example, the Materialization Build module 780 can beconfigured to build an inverted index (e.g., a data store) that allowsfor searching of the inputs. In some embodiments, the MaterializationBuild module 780 can be configured to build a materialization on aper-view basis. In other embodiments, the Materialization Build module780 can be configured to build a materialization that includes multipleviews.

In some embodiments, a simplified example of an inverted indexmaterialization can include the following:

Sample Data:

-   -   doc_id_0, entity_id_0, view_id, Business, San Diego, Calif.    -   doc_id_1, entity_id_1, view_id, Business, San Francisco, Calif.

Index:

-   -   entity_id_0: {doc_id_0}    -   entity_id_1: {doc_id_1}    -   Business: {doc_id_0, doc_id_1}    -   San: {doc_id_0, doc}    -   Diego: {doc_id_0}    -   Francisco: {doc_id_1}    -   CA: {doc_id_0, doc_id_1}

Using the simplified index in the example, the data can be easilysearchable by keyword or other attributes. For example, searching for“Diego”, would yield summaries for doc_id_0 and doc_id_1 in thisexample.

If there are multiple views per materialization, the view_id could beused as an additional keyword filter for searches.

In some embodiments, each materialized data store can be associated witha particular application domain, a particular service, or a particularview. Therefore, when a system receives a query for data, the system candetermine, based on the particular application domain, the particularservice, and/or the particular view associated with the query and/orrequested data, one or more of the materialized data stores to serve thequery.

Batch Processing MakeLive

MakeLive is a process by which a Batch Build can be put into production.The MakeLive process can be accomplished through Data store Loading,Catchup and New

Materialization Notification. After the MakeLive process is completed,all API requests can use the newly batch-built data.

Data Store Loading

Once a Batch Data Build passes all required regression and other QualityAssurance tests, a new table in the data store can be created with a newversion number for FullProcessedInputs 360, FullProcessedSummaries 760,QuickProcesesdInputs 150, and QuickProcessedSummaries 190. The datastore format files (FullProcessedInputs 360, FullProcessedSummaries 760)can be loaded into their respective new tables. Diffs 200/770 can beappended to an existing DiffTable.

In FIG. 1, the Real-Time Processing can refer to newly builtFullProcessedInputs 360 and FullProcessedSummaries 760 through a datastore-api-server once data store loading is complete. This can beaccomplished by changing the pointer of the FullProcessedInputs 360 andFullProcessedSummaries 760 tables so that the newer tables are visibleto Real-Time Processing and the older references are no longer visibleto Real-Time Processing.

An example of the loading of FullProcessedInputs 360 is illustrated inthe transition from 729 to 360 in FIG. 1. An example of the loading ofFullProcessedSummaries 760 is illustrated in the transition from 740,750 to 760 in FIG. 1.

Catchup Phase

During the time between when the batch run was started and when theCatchup Phase is first initiated, the data store may have takenadditional real-time writes that were not processed during our BatchBuild step. The real-time writes can refer to any writes that have beenreceived in real-time and have generated QuickProcessed inputs. Once theBatch Build step is completed, thereby creating a newly batch builtdataset, the Catchup Phase may update the newly batch built dataset,maintained in the indexed data stores 510, 520, or a Diff API toDownload partners 500, based on these new real-time writes, so that thenewly batch built dataset becomes up to date with the additionalreal-time writes.

FIG. 3 illustrates the Catchup process in accordance with someembodiments. To accomplish Catchup, the Quick Processed Inputs 810 fromthe prior version of the dataset can be each copied into the new QuickProcessed Inputs 820, based on whether the timestamps of those inputsare after the timestamp at which the batch run was initiated.Specifically, each input in Quick Processed Inputs 810 can be added tothe new Quick Processed Inputs 820 with the same entity_id (if itexists) for the new QuickProcessedInput table. If the same entity_iddoesn't exist in the new QuickProcessedInput table, it can create abrand new input set for that entity_id. For each entity_id with anadditional input, re-summarization 830 can be performed for all views.If the generated summaries are different from the inputs fromFullProcessedSummaries, a diff is written to the DiffTable 840. Thematerialization 880 is in turn updated by any new Diffs 840.

New Materialization Data Store Notification

The final step for making a batch built dataset into production-readydataset can include a process for enabling the FullProcessedInputs,FullProcessedSummaries, QuickProcessedInputs, andQuickProcessedSummaries tables. A flag can be cleared and the SummaryMaterialization versions can be updated to point to the newly builtones. This process can change the pointer from previous versions of 510and 520 with the newest versions of the materializations built by thelatest Batch Build.

The Unprocessed Inputs provided by the Real-Time Workflow at the PreBatch Build step can be copied into the Unprocessed Inputs 350, so thatthey can be processed by the next Batch Data Build. The UnprocessedInputs 350 can be deduplicated to prevent duplicate entries.

After these steps, all updates to the data can be handled by theReal-Time Data Processing workflow, until the next Scheduled Batch DataBuild.

Embodiments of the disclosed system can be used in a variety ofapplications. For example, embodiments of the disclosed system can beused to gather and summarize data from various application domains, suchas social networking, online advertisements, search engines, medicalservices, media services, consumer package goods, video games, supportgroups, or any other application domains from which a large amount ofdata is generated and maintained.

Executable Code Embodiments

Embodiments of the disclosed system may be built upon logic or modulescomprising executable code. The executable code can be stored on one ormore memory devices. Accordingly, a logic does not have to be located ona particular device. In addition, a logic or a module can be multipleexecutable codes located on one or more devices in the systems disclosedherein. For instance, access logic responsive to an input for accessingand retrieving data stored in one or more cells in the data store can beone executable code on an application server. In alternativeembodiments, such access logic is found on one or more applicationservers. In still other embodiments, such access logic is found on oneor more application servers and other devices in the system, including,but not limited to, “gateway” summary data servers and back-end dataservers. The other logics disclosed herein also can be one or moreexecutable code located on one or more devices within a collaborativedata system.

In certain embodiments, the disclosed systems comprise one or moreapplication servers, as well as one or more summary data servers, andone or more back-end data servers. The servers comprise memory to storethe logics disclosed herein. In particular embodiments, the one or moreapplication servers store the logics necessary to perform the tasksdisclosed herein. In other embodiments, the summary servers store thelogics necessary to perform the tasks disclosed herein. In otherembodiments, the back-end servers store the logics necessary to performthe tasks disclosed herein.

In certain embodiments, the client web browser makes requests to the oneor more application servers. Alternatively, the disclosed systemscomprise one or more summary or back-end data servers to which theclient web browser makes requests.

In an exemplary embodiment, the one or more application servers receiverequests from the client web browser for specific data or tables. Uponthese requests, the one or more application servers calls upon one ormore data store servers to request summary or detail data from cells ortables. The one or more application servers also call upon the one ormore data store servers when a request to submit new data inputs ismade. The one or more application servers receive the data from the oneor more summary servers and the one or more application servers generateHTML and JavaScript objects to pass back to the client web browser.Alternatively, the one or more application servers generate XML or JSONto pass objects through an API.

In one embodiment, the data store servers are based on an architectureinvolving a cluster of summary data servers and a cluster of back-enddata servers. Note, however, that a system could include a singlesummary server and back-end data server. In this embodiment, the arrayof summary data servers are utilized to request from back-end dataservers, summary data and attributes of such summarized data points(confidence, counts, etc.). The array of summary servers also cachessuch summary data and summary attributes so that faster access to suchsummary data can be access without the need for an additional request tothe back-end data server.

The present systems and processes rely on executable code (i.e., logic)stored on memory devices. Memory devices capable of storing logic areknown in the art. Memory devices include storage media such as computerhard disks, redundant array of inexpensive disks (“RAID”), random accessmemory (“RAM”), and optical disk drives. Examples of generic memorydevices are well known in the art (e.g., U.S. Pat. No. 7,552,368,describing conventional semiconductor memory devices and such disclosurebeing herein incorporated by reference).

Other embodiments are within the scope and spirit of the disclosedsubject matter.

The subject matter described herein can be implemented in digitalelectronic circuitry, or in computer software, firmware, or hardware,including the structural means disclosed in this specification andstructural equivalents thereof, or in combinations of them. The subjectmatter described herein can be implemented as one or more computerprogram products, such as one or more computer programs tangiblyembodied in an information carrier (e.g., in a machine-readable storagedevice), or embodied in a propagated signal, for execution by, or tocontrol the operation of, data processing apparatus (e.g., aprogrammable processor, a computer, or multiple computers). A computerprogram (also known as a program, software, software application, orcode) can be written in any form of programming language, includingcompiled or interpreted languages, and it can be deployed in any form,including as a stand-alone program or as a module, component,subroutine, or other unit suitable for use in a computing environment. Acomputer program does not necessarily correspond to a file. A programcan be stored in a portion of a file that holds other programs or data,in a single file dedicated to the program in question, or in multiplecoordinated files (e.g., files that store one or more modules,sub-programs, or portions of code). A computer program can be deployedto be executed on one computer or on multiple computers at one site ordistributed across multiple sites and interconnected by a communicationnetwork.

The processes and logic flows described in this specification, includingthe method steps of the subject matter described herein, can beperformed by one or more programmable processors executing one or morecomputer programs to perform functions of the subject matter describedherein by operating on input data and generating output. The processesand logic flows can also be performed by, and apparatus of the subjectmatter described herein can be implemented as, special purpose logiccircuitry, e.g., an FPGA (field programmable gate array) or an ASIC(application-specific integrated circuit).

Processors suitable for the execution of a computer program include, byway of example, both general and special purpose microprocessors, andany one or more processor of any kind of digital computer. Generally, aprocessor will receive instructions and data from a read-only memory ora random access memory or both. The essential elements of a computer area processor for executing instructions and one or more memory devicesfor storing instructions and data. Generally, a computer will alsoinclude, or be operatively coupled to receive data from or transfer datato, or both, one or more mass storage devices for storing data, e.g.,magnetic, magneto-optical disks, or optical disks. Information carrierssuitable for embodying computer program instructions and data includeall forms of non-volatile memory, including by way of examplesemiconductor memory devices, (e.g., EPROM, EEPROM, and flash memorydevices); magnetic disks, (e.g., internal hard disks or removabledisks); magneto-optical disks; and optical disks (e.g., CD and DVDdisks). The processor and the memory can be supplemented by, orincorporated in, special purpose logic circuitry.

To provide for interaction with a user, the subject matter describedherein can be implemented on a computer having a display device, e.g., aCRT (cathode ray tube) or LCD (liquid crystal display) monitor, fordisplaying information to the user and a keyboard and a pointing device,(e.g., a mouse or a trackball), by which the user can provide input tothe computer. Other kinds of devices can be used to provide forinteraction with a user as well. For example, feedback provided to theuser can be any form of sensory feedback, (e.g., visual feedback,auditory feedback, or tactile feedback), and input from the user can bereceived in any form, including acoustic, speech, or tactile input.

The techniques described herein can be implemented using one or moremodules. As used herein, the term “module” refers to computing software,firmware, hardware, and/or various combinations thereof. At a minimum,however, modules are not to be interpreted as software that is notimplemented on hardware, firmware, or recorded on a non-transitoryprocessor readable recordable storage medium. Indeed “module” is to beinterpreted to include at least some physical, non-transitory hardwaresuch as a part of a processor or computer. Two different modules canshare the same physical hardware (e.g., two different modules can usethe same processor and network interface). The modules described hereincan be combined, integrated, separated, and/or duplicated to supportvarious applications. Also, a function described herein as beingperformed at a particular module can be performed at one or more othermodules and/or by one or more other devices instead of or in addition tothe function performed at the particular module. Further, the modulescan be implemented across multiple devices and/or other components localor remote to one another. Additionally, the modules can be moved fromone device and added to another device, and/or can be included in bothdevices.

The subject matter described herein can be implemented in a computingsystem that includes a back-end component (e.g., a data server), amiddleware component (e.g., an application server), or a front-endcomponent (e.g., a client computer having a graphical user interface ora web browser through which a user can interact with an implementationof the subject matter described herein), or any combination of suchback-end, middleware, and front-end components. The components of thesystem can be interconnected by any form or medium of digital datacommunication, e.g., a communication network. Examples of communicationnetworks include a local area network (“LAN”) and a wide area network(“WAN”), e.g., the Internet.

The terms “a” or “an,” as used herein throughout the presentapplication, can be defined as one or more than one. Also, the use ofintroductory phrases such as “at least one” and “one or more” should notbe construed to imply that the introduction of another element by theindefinite articles “a” or “an” limits the corresponding element to onlyone such element. The same holds true for the use of definite articles.

It is to be understood that the disclosed subject matter is not limitedin its application to the details of construction and to thearrangements of the components set forth in the following description orillustrated in the drawings. The disclosed subject matter is capable ofother embodiments and of being practiced and carried out in variousways. Also, it is to be understood that the phraseology and terminologyemployed herein are for the purpose of description and should not beregarded as limiting.

As such, those skilled in the art will appreciate that the conception,upon which this disclosure is based, may readily be utilized as a basisfor the designing of other structures, methods, and systems for carryingout the several purposes of the disclosed subject matter. It isimportant, therefore, that the claims be regarded as including suchequivalent constructions insofar as they do not depart from the spiritand scope of the disclosed subject matter.

Although the disclosed subject matter has been described and illustratedin the foregoing exemplary embodiments, it is understood that thepresent disclosure has been made only by way of example, and thatnumerous changes in the details of implementation of the disclosedsubject matter may be made without departing from the spirit and scopeof the disclosed subject matter.

We claim: 1.-23. (canceled)
 24. A method comprising: generating a firstsummary data using a set of data, the first summary data includes afirst entity identifier and a first value associated with the firstentity identifier; generating a second summary data using the first setof data and a second set of data, the second summary data includes asecond entity identifier and a second value associated with the secondentity identifier; determining a difference between the first summarydata and the second summary data; and updating the first summary databased upon the difference between the first summary data and the secondsummary data.
 25. The method of claim 24, wherein the first set of datacomprises bulk data input.
 26. The method of claim 25, wherein the bulkdata input comprises one or more of: raw information received from oneor more contributors; web-crawler data received from a web-crawler; ordata received from a storage center.
 27. The method of claim 25, whereinthe second set of data comprises intermittent data.
 28. The method ofclaim 27, wherein the intermittent data comprises real-time datasubmissions.
 29. The method of claim 27, further comprising: formattingthe bulk data input into structured data; group a plurality of elementsin the structured data; and generate an entity identifier for theplurality of elements.
 30. The method of claim 24, wherein the firstsummary data comprises a first entity identifier and the second summarydata comprises a second entity identifier, and wherein determining thedifference between the first summary data and the second summary datacomprises: determining whether a first value of the first entityidentifier and a second value of the second entity identifier are equal;and when the first value and second value are equal, comparing dataassociated with the first entity identifier and the second entityidentifier.
 31. A non-transitory computer-readable storage mediumcomprising computer-executable instructions that, when executed by atleast one processor, cause the at least one processor to perform amethod comprising: generating a first summary data using a set of data,the first summary data includes a first entity identifier and a firstvalue associated with the first entity identifier; generating a secondsummary data using the first set of data and a second set of data, thesecond summary data includes a second entity identifier and a secondvalue associated with the second entity identifier; determining adifference between the first summary data and the second summary data;and updating the first summary data based upon the difference betweenthe first summary data and the second summary data.
 32. Thenon-transitory computer-readable storage medium of claim 31, wherein themethod further comprises: generate a third data set by combining thefirst data set and the second data set; and generate third summary datafor the third data set.
 33. The non-transitory computer-readable storagemedium of claim 31, wherein the first data set comprises one or more of:raw information received from one or more contributors; web-crawler datareceived from a web-crawler; or data received from a storage center. 34.The non-transitory computer-readable storage medium of claim 33, whereinthe second data set comprises data received from a user to correctinformation in the first data set.
 35. The non-transitorycomputer-readable storage medium of claim 34, wherein the first summarydata comprises a first entity identifier and the second summary datacomprises a second entity identifier, and wherein determining thedifference between the first summary data and the second summary datacomprises: determining whether a first value of the first entityidentifier and a second value of the second entity identifier are equal;and when the first value and second value are equal, comparing dataassociated with the first entity identifier and the second entityidentifier.
 36. The non-transitory computer-readable storage medium ofclaim 31, wherein the method further comprises processing the first dataset to generate a first structured data set.
 37. The non-transitorycomputer-readable storage medium of claim 36, wherein the second dataset comprises real-time data submissions, and wherein the method furthercomprises processing the second data set to generate a second structureddata set in response to receiving the real-time data submissions.
 38. Asystem comprising: at least one processor; and memory encodingcomputer-executable instructions that, when executed by the at least oneprocessor, perform a method comprising: generating a first summary datausing a set of data, the first summary data includes a first entityidentifier and a first value associated with the first entityidentifier; generating a second summary data using the first set of dataand a second set of data, the second summary data includes a secondentity identifier and a second value associated with the second entityidentifier; determining a difference between the first summary data andthe second summary data; and updating the first summary data based uponthe difference between the first summary data and the second summarydata.
 39. The system of claim 38, wherein the first set of datacomprises bulk data input.
 40. The system of claim 39, wherein the bulkdata input comprises one or more of: raw information received from oneor more contributors; web-crawler data received from a web-crawler; ordata received from a storage center.
 41. The system of claim 39, whereinthe second set of data comprises intermittent data.
 42. The system ofclaim 41, wherein the intermittent data comprises real-time datasubmissions.
 43. The system of claim 41, further comprising: formattingthe bulk data input into structured data; group a plurality of elementsin the structured data; and generate an entity identifier for theplurality of elements.
 44. The system of claim 24, wherein the firstsummary data comprises a first entity identifier and the second summarydata comprises a second entity identifier, and wherein determining thedifference between the first summary data and the second summary datacomprises: determining whether a first value of the first entityidentifier and a second value of the second entity identifier are equal;and when the first value and second value are equal, comparing dataassociated with the first entity identifier and the second entityidentifier.